\section{Introduction}\label{sec:intro}

The human ability to understand language is \emph{general}, \textit{flexible}, and \textit{robust}. 
% We can effectively interpret and respond to utterances of diverse form and function in many different contexts. % SB: We don't really want to talk about context here, as that'll cue readers to ask about sentence understanding in broader contexts than we test.
In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. 
If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, 
then it is critical to develop a more unified model that can learn to execute a range of different linguistic tasks in different domains.

%To motivate research in this direction, we present the General Language Understanding Evaluation benchmark (GLUE, \href{https://gluebenchmark.com}{\tt gluebenchmark.com}), an online tool for evaluating the performance of a single NLU model across multiple tasks, including question answering, sentiment analysis, and textual entailment, built largely on established existing datasets.
To facilitate research in this direction, we present the General Language Understanding Evaluation 
%(GLUE, \href{https://gluebenchmark.com}{\tt gluebenchmark.com}) 
(GLUE)
benchmark: a collection of NLU tasks including question answering, sentiment analysis, and textual entailment, and an associated online platform for model evaluation, comparison, and analysis.
GLUE does not place any constraints on model architecture beyond the ability to process single-sentence and sentence-pair inputs and to make corresponding predictions. 
For some GLUE tasks, training data is plentiful, but for others it is limited or fails to match the genre of the test set. GLUE therefore favors models that can learn to represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. 
None of the datasets in GLUE were created from scratch for the benchmark; we rely on preexisting datasets because they have been implicitly vetted by the community to be challenging and interesting.
%While none of the datasets in GLUE were created from scratch for the benchmark, 
Four of the datasets feature privately-held test data, which will be used to ensure that the benchmark is used fairly.
% SB: We mention private data in the contributions list, so it should appear first here. Here's my attempt.

%Though GLUE is designed to prefer models with general and robust language understanding, we cannot entirely rule out the existence of simple superficial strategies for solving any of the included tasks. 
To understand the types of knowledge learned by models and to encourage linguistic-meaningful solution strategies, GLUE also includes a set of hand-crafted analysis examples for probing trained models. 
%Unlike many test sets employed in machine learning research that reflect the frequency distribution of naturally occurring data or annotations, 
%This dataset is designed to highlight points of difficulty that are relevant to model development and training, such as the incorporation of world knowledge or the handling of lexical entailments and negation. 
This dataset is designed to highlight common challenges, such as the use of world knowledge, logical operators, and lexical entailments, that models must handle to robustly solve the tasks.

To better understand the challenged posed by GLUE, we conduct experiments with simple baselines and state-of-the-art sentence representation models.
We find that unified multi-task trained models slightly outperform comparable models trained on each task separately.
Our best multi-task model makes use of ELMo \citep{peters2018deep},
a recently proposed pre-training technique.
However, this model still achieves a fairly low absolute score.
%, indicating room for improved general NLU systems.
Analysis with our diagnostic dataset reveals that our baseline models deal well with strong lexical signals but struggle with deeper logical structure.

In summary, we offer: (i) A suite of nine sentence or sentence-pair NLU tasks, built on established annotated datasets and selected to cover a diverse range of text genres, dataset sizes, and degrees of difficulty. (ii) An online evaluation platform and leaderboard, based primarily on privately-held test data. The platform is model-agnostic, and can evaluate any method capable of producing results on all nine tasks. (iii) An expert-constructed diagnostic evaluation dataset. (iv) Baseline results for several major existing approaches to sentence representation learning.
